Hierarchical clustering can be divided into two main types: agglomerative and divisive.
Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.
Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster (see figure below).
Note that agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters.
The merging or the division of clusters is performed according some (dis)similarity measure. In R softwrare, the Euclidean distance is used by default to measure the dissimilarity between each pair of observations.
As we already know, it’s easy to compute dissimilarity measure between two pairs of observations. It’s mentioned above that two clusters that are most similar are fused into a new big cluster.
A natural question is : How to measure the dissimilarity between two clusters of observations?
The most common types methods are:
Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
Complete linkage and Ward’s method are generally preferred.
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
completelinkage <- hclust(dist(scale(USArrests)), 'complete')
plot(completelinkage)
# scale is generic function whose default method centers and/or scales the columns of a numeric matrix.
singlelinkage <- hclust(dist(scale(USArrests)), 'single')
plot(singlelinkage)
averagelinkage <- hclust(dist(scale(USArrests)), 'average')
plot(averagelinkage)
wardlinkage <- hclust(dist(scale(USArrests)), 'ward.D2')
plot(wardlinkage)
data('USArrests')
dataclear <- na.omit(USArrests)
head(dataclear, n=6)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
desc_stats <- data.frame(
Min = apply(dataclear, 2, min), # minimum
Med = apply(dataclear, 2, median), # median
Mean = apply(dataclear, 2, mean), # mean
SD = apply(dataclear, 2, sd), # Standard deviation
Max = apply(dataclear, 2, max) # Maximum
)
desc_stats <- round(desc_stats, 1)
head(desc_stats)
## Min Med Mean SD Max
## Murder 0.8 7.2 7.8 4.4 17.4
## Assault 45.0 159.0 170.8 83.3 337.0
## UrbanPop 32.0 66.0 65.5 14.5 91.0
## Rape 7.3 20.1 21.2 9.4 46.0
the variables have a large different means and variances. This is explained by the fact that the variables are measured in different units; Murder, Rape, and Assault are measured as the number of occurrences per 100 000 people, and UrbanPop is the percentage of the state’s population that lives in an urban area.
They must be standardized (i.e., scaled) to make them comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.
scaled_stats <- scale(dataclear)
head(dataclear)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
There are different functions available in R for computing hierarchical clustering. The commonly used functions are:
hclust() [in stats package] and agnes() [in cluster package] for agglomerative hierarchical clustering (HC) diana() [in cluster package] for divisive HC
The simplified format is:
hclust(d, method = “complete”)
d => a dissimilarity structure as produced by the dist() function.
method => The agglomeration method to be used. Allowed values is one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”.
# Dissimilarity matrix
d <- dist(scaled_stats, method = "euclidean")
# Hierarchical clustering using Ward's method
res.hc <- hclust(d, method = "ward.D2" )
# Plot the obtained dendrogram
plot(res.hc, cex = 0.6, hang = -1)
The R function agnes() (in cluster package) can be also used to compute agglomerative hierarchical clustering. The R function diana() (in cluster package) is an example of divisive hierarchical clustering.
Agglomerative Nesting (Hierarchical Clustering) agnes(x, metric = “euclidean”, stand = FALSE, method = “average”)
DIvisive ANAlysis Clustering diana(x, metric = “euclidean”, stand = FALSE)
x: data matrix or data frame or dissimilarity matrix. In case of matrix and data frame, rows are observations and columns are variables. In case of a dissimilarity matrix, x is typically the output of daisy() or dist(). metric: the metric to be used for calculating dissimilarities between observations. Possible values are “euclidean” and “manhattan”. stand: if TRUE, then the measurements in x are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable’s mean value and dividing by the variable’s mean absolute deviation method: The clustering method. Possible values includes “average”, “single”, “complete”, “ward”.
The function agnes() returns an object of class “agnes” which has methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().
The function diana() returns an object of class “diana” which has also methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().
Compared to other agglomerative clustering methods such as hclust(), agnes() has the following features:
It yields the agglomerative coefficient (see agnes.object) which measures the amount of clustering structure found Apart from the usual tree it also provides the banner, a novel graphical display (see plot.agnes).
library("cluster")
# Compute agnes()
res.agnes <- agnes(scaled_stats, method = "ward")
# Agglomerative coefficient
res.agnes$ac
## [1] 0.934621
## [1] 0.934621
# Plot the tree using pltree()
pltree(res.agnes, cex = 0.6, hang = -1,
main = "Dendrogram of agnes")
# plot.hclust()
plot(as.hclust(res.agnes), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.agnes), cex = 0.6,
horiz = TRUE)
# Compute diana()
res.diana <- diana(scaled_stats)
# Divise coefficient
res.diana$dc
## [1] 0.8514345
# Plot the tree
pltree(res.diana, cex = 0.6, hang = -1,
main = "Dendrogram of diana")
# plot.hclust()
plot(as.hclust(res.diana), cex = 0.6, hang = -1)
# plot.dendrogram()
plot(as.dendrogram(res.diana), cex = 0.6,
horiz = TRUE)
In the dendrogram displayed above, each leaf corresponds to one observation. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are.
Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused. We cannot use the proximity of two observations along the horizontal axis as a criteria of their similarity.
In order to identify sub-groups (i.e. clusters), we can cut the dendrogram at a certain height as described in the next section.
The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering.
The function cutree() is used and it returns a vector containing the cluster number of each observation:
# Cut tree into 4 groups
grp <- cutree(res.hc, k = 4)
# Number of members in each cluster
table(grp)
## grp
## 1 2 3 4
## 7 12 19 12
# Get the names for the members of cluster 1
rownames(scaled_stats)[grp == 1]
## [1] "Alabama" "Georgia" "Louisiana" "Mississippi"
## [5] "North Carolina" "South Carolina" "Tennessee"
## [1] "Alabama" "Georgia" "Louisiana" "Mississippi"
## [5] "North Carolina" "South Carolina" "Tennessee"
It’s also possible to draw the dendrogram with a border around the 4 clusters. The argument border is used to specify the border colors for the rectangles:
plot(res.hc, cex = 0.6)
rect.hclust(res.hc, k = 4, border = 2:5)
Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_cluster(list(data=scaled_stats, cluster = grp))
The function cutree() can be used also to cut the tree generated with agnes() and diana() as follow:
# Cut agnes() tree into 4 groups
cutree(res.agnes, k = 4)
## [1] 1 2 2 3 2 2 3 3 2 1 3 4 2 3 4 3 3 1 4 2 3 2 4 1 3 4 4 2 4 3 2 2 1 4 3
## [36] 3 3 3 3 1 4 1 2 3 4 3 3 4 4 3
# Cut diana() tree into 4 groups
cutree(as.hclust(res.diana), k = 4)
## Alabama Alaska Arizona Arkansas California
## 1 2 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
Package dendextend which contains many functions for comparing two dendrograms, including: dend_diff(), tanglegram(), entanglement(), all.equal.dendrogram(), cor.dendlist().
A random subset of the dataset will be used in the following example. The function sample() is used to randomly select 10 observations among the 50 observations contained in the data set
# Subset containing 10 rows
set.seed(123)
ss <- sample(1:50, 10)
df <- data.frame(ss)
In the R code below, we’ll start by computing pairwise distance matrix using the function dist(). Next, hierarchical clustering (HC) is computed using two different linkage methods (“average” and “ward.D2”). Finally the results of HC are transformed as dendrograms:
# Compute distance matrix
res.dist <- dist(df, method = "euclidean")
# Compute 2 hierarchical clusterings
hc1 <- hclust(res.dist, method = "average")
hc2 <- hclust(res.dist, method = "ward.D2")
dend1 <- as.dendrogram (hc1)
dend2 <- as.dendrogram (hc2)
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.8.0
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
dend_list <- dendlist(dend1, dend2)
The function tanglegram() plots two dendrograms, side by side, with their labels connected by lines. It can be used for visually comparing two methods of Hierarchical clustering as follow:
tanglegram(dend1, dend2)
Note that “unique” nodes, with a combination of labels/items not present in the other tree, are highlighted with dashed lines.
The quality of the alignment of the two trees can be measured using the function entanglement().
tanglegram(dend1, dend2)
tanglegram(dend1, dend2,
highlight_distinct_edges = FALSE, # Turn-off dashed lines
common_subtrees_color_lines = FALSE, # Turn-off line colors
common_subtrees_color_branches = TRUE, # Color common branches
main = paste("entanglement =", round(entanglement(dend_list), 2))
)
# Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment
The function cor.dendlist() is used to compute “Baker” or “Cophenetic” correlation matrix between a list of trees.
# Cophenetic correlation matrix
cor.dendlist(dend_list, method = "cophenetic")
## [,1] [,2]
## [1,] 1.0000000 0.9857437
## [2,] 0.9857437 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
## [,1] [,2]
## [1,] 1.0000000 0.9677821
## [2,] 0.9677821 1.0000000
The correlation between two trees can be also computed as follow:
# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.9857437
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9677821
It’s also possible to compare simultaneously multiple dendrograms. A chaining operator %>% (available in dendextend) is used to run multiple function at the same time. It’s useful for simplifying the code:
# Subset data
set.seed(123)
ss <- sample(1:150, 10 )
# Create multiple dendrograms by chaining
dend1 <- df %>% dist %>% hclust("com") %>% as.dendrogram
dend2 <- df %>% dist %>% hclust("single") %>% as.dendrogram
dend3 <- df %>% dist %>% hclust("ave") %>% as.dendrogram
dend4 <- df %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list <- dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors <- cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
## Complete Single Average Centroid
## Complete 1.00 0.97 0.99 0.98
## Single 0.97 1.00 0.99 0.99
## Average 0.99 0.99 1.00 1.00
## Centroid 0.98 0.99 1.00 1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
## corrplot 0.84 loaded
corrplot(cors, "pie", "lower")
Hierarchical clustering is an approach which builds a hierarchy from the bottom-up, and doesn’t require to specify the number of clusters beforehand.
This is usually represented by dendogram.
We can also determine how close two clusters are: - Complete linkage clustering: Find the maximum possible distance between points belonging to two different clusters - Single linkage clustering: Find the minimum possible distance between points belonging to two different clusters - Mean linkage clustering: Find all possible pairwise distances for points belonging to two different clusters and then calculate the average - Centroid linkage clustering: Find the centroid of each cluster and calculate the distance between centroids of two clusters
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# There are 3 species of iris flowers.
In order to perform hierarchical clustering, ‘hclust’ will be used. It requires to provide the data in the form of a distance matrix. This can be done by using ‘dist’.
clusters <- hclust(dist(iris[,3:4]))
plot(clusters)
# iris[,3:4] => return petal.length and petal.width
This generated a dendogram. Now we can decide on number of clusters. In this example, it could be 3 or 4. To cut number of clusters, ‘cutree’ will be used.
clustercut<- cutree(clusters,3)
Let’s compare this with original data
table(clustercut, iris$Species)
##
## clustercut setosa versicolor virginica
## 1 50 0 0
## 2 0 21 50
## 3 0 29 0
setosa and virginica were successfuly classified into clusters. Versicolor was split into two clusters. In the graph below, I can see why.
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +
geom_point()
Other linkage method for clustering can be used e.g. mean linkage method:
clusters <- hclust(dist(iris[,3:4]), method = 'average')
plot(clusters)
clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
##
## clusterCut setosa versicolor virginica
## 1 50 0 0
## 2 0 45 1
## 3 0 5 49
Algorithm did a much better job of clustering the data, only went wrong with 6 of the data points.
We can execute similar approaches for hierarchical clustering as with k-means clustering
To perform the elbow method we just need to change the second argument in fviz_nbclust to FUN=hcut
I will used previously defined ‘scale_stats’
fviz_nbclust(scaled_stats, FUN = hcut, method = "wss")
To perform the average silhouette method we follow a similar process.
fviz_nbclust(scaled_stats, FUN=hcut, method = "silhouette")
gap_stat <- clusGap(scaled_stats, FUN = hcut, nstart = 25, K.max = 10, B = 50)
fviz_gap_stat(gap_stat)
By default, heatmap clusters by both rows and columns. It then reorders the resulting dendrograms according to mean. Setting Colv to false tells it not to reorder the columns, which will come in handy later. Let’s also turn off the default scaling across rows. We’ve already scaled across columns, which is the sensible thing to do in this case.
# cluster rows
hc.rows <- hclust(dist(scaled_stats))
plot(hc.rows)
# transpose the matrix and cluster columns
hc.cols <- hclust(dist(t(scaled_stats)))
# heatmap
heatmap(scaled_stats, Colv=F, scale='none')
# draw heatmap for first cluster, k determines number of cluster (if 2, dendogram is divided into 2 clusters and zoom-in)
heatmap(scaled_stats[cutree(hc.rows,k=2)==1,], Colv=as.dendrogram(hc.cols), scale='none')
# draw heatmap for second cluster
heatmap(scaled_stats[cutree(hc.rows,k=2)==2,], Colv=as.dendrogram(hc.cols), scale='none')